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Abstract 

We give a majorization algorithm for weighted least squares low-rank matrix approx¬ 
imation, a.k.a. principal component analysis. The loss function has one non-negative 
weight for each squared residual. A quadratic programming method is used to compute 
optimal rank-one weights for the majorization scheme. 
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1 Introduction 


In a rank-p matrix approximation problem we have an n x m matrix X and we want to find 
the closest n x m matrix X of rank-p in the least squares sense (Eckart and Young (1936), 
Keller (1962)). The problem is usually parametrized as A" = AB' where A is n x p and B 
is m x p. If closeness is measured by the usual Frobenius norm, i.e. the unweighted sum of 
squares, then the solution is given by the first p singular values of X and their corresponding 
left and right singular vectors. Generalizations to more general Euclidean norms are discussed, 
with some historical references, in De Leeuw (1984) (and undoubtedly in hundreds of other 
places). In this paper we generalized to simple weighted least squares in which the square of 
each residual Xij — x^ is weighted by a non-negative weight uy,-. Thus the loss function we 
must minimize is 

n m p 

(7(A, B) 22 — Y1 a isbjs) ■ 

i= 1 j =1 s=l 

This general problem was perhaps first tackled by Gabriel and Zamir (1979), using an 
alternating least squares algorithm of the form (with superscript k indicating iteration count) 


A (k '> = argmin a(A,B {k) ), 

A 

B (k+l ) = argmin a^A^jB). 

B 

Both subproblems of this criss-cross regression algorithm are weighted linear least squares 
problems, with a possibly non-sparse matrix of weights. We will go a different route, by 
concentrating on simplifying the weights. 


2 Majorization using Weights 


Majorization algorithms are discussed in general in De Leeuw (1994) and Heiser (1995), and 
more recently and more extensively in Lange (2016) and De Leeuw (2016a). We assume the 
basic majorization approach is familiar and go straight away to the details. 


Our first job is to construct a majorization scheme. 

Suppose A^ and B^ are solutions after k iterations. Define 


a (*0 . = 
x ij ’ 


x> LYh 

S=1 


Then 


and consequently 


v 

I 

s=l 


= X ij ] + (Y a i* b 


■ X {k) ) 

JS )•) 


s =1 


*= 1 3 = 1 


Wij \%ij 


r n m p 

-Zij^CY a isbjs- X ij ] )+Y Y w ij(Y aisbjs-?^ 2 
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If Cij > Wij and c %3 > 0 we have 


<7 


n m p 

%( k )^ 


(A, B) < a(A {k \ B {k) )-2^2^2w ij (x ij -x\j ) )(Y^ a ia b ja -x\f) + ^2^2 Cy(^ a is bjs-x l3 


(fc )\2 


i= 1 3 = 1 


Define 


and 


S=1 


(fc) , 

rp \ / • -[ /v» . , rp ' 

' ij ‘ V**'*? ^ij )i 

Cij 


i=l j =1 s=l 


s(fc)> 


/ 7 ( fc ) y( fc ) , !%q r .. _ A k )\ 

n ij ■— x ij ' ij )• 


C, 


V 


Note that Jiff is a convex combination of xff and x t j. Now 

n m 

<t(A, B) < <t{ol, c ij{ r ij > } 2 + J2J2 c v(J2 a isbjs - h 

i= 1 

In the majorization step we minimize 


n m p 

( fc h2 „ /V „ h h( fc )\2 

ij ) 

i= 1 j =1 i =1 j =1 s =1 


n m p 

ri(A, B, A<*>, B^) := E E «j(E a ‘‘ b i- ~ S*?) 2 . 

i=lj=l s=l 

which is a matrix approximation problem similar to our original problem, with adjusted 
weights and an adjusted target. In the next iteration we define a new target, solve the matrix 
approximation problem again, and so on. 

It seems that our developments so far have not gained us much. Instead of one matrix 
approximation problem we now have a sequence of them, all with the same structure. The 
gain will have to come from the choice of the c t]1 which we will try to construct in such a 
way that the matrix approximation problems in each iteration become more simple. 

2.1 Rank-one Weights 

First suppose c tJ > 0 and there are u > 0 and v > 0 such that Cy = UiVj. Now 

(fc) /-T(fc) 

9ij : = yftwi h ij ’ 

and define new variables cq s = y 7 Uia is and j3j S = y/v]b ]S . Then 

n m p n m p 

J2 J2 Cii(Z) a ^ b 3s - hff = J2J2(J2 a i S (5j S - g^f 

i=l j=l s=l i=l j=l s=l 

This shows that the matrix approximation problems that must be solved in each majorization 
step are unweighted, which means they can be solved rapidly using partial singular value 
decomposition. 
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2.2 The Symmetric Case 

If X is symmetric our loss function becomes 

n n p 

Cr(A) = ^ W ij( X ij — a isdjs) ■ 

1=1j =1 S=1 

Now choose c^- = UiUj < Wij , and basically the same majorization scheme can be applied. The 
matrix approximation problem in each iteration is now solved by using the first p non-negative 
eigenvalues of the current target (Keller (1962), De Leenw (1974)). 


2.3 Some History 

From an historical point of view the most important special case is that in which w^j is either 
one or zero, which zero indicating missing data. It was analyzed by Thomson (1934) in the 
case of factor analysis, using basically the same alternating least squares algorithm used by 
Yayes (1933) in factorial experiments. 

We derive this algorithm from our majorization perspective by choosing c i3 = 1, no matter if 
= 1 or = 0. Thus 


( k ) / ~(fc)\ 

r\j = Wij{Xij - x\j’) 


0 if w^ = 0, 

Xjj - x)j' if w^ = 1, 


and 


/,( fc ) _ a(*0 i r ( k ) _ 
,L ij ^ ij ' ' ij 


X 


w 

y 


x. 


y 


if w^ = 0, 
if w^ = 1. 


This is exactly the Thomson-Yates method for dealing with data that are missing, either by 
design, by accident, or structurally from properties of the model. The equivalence of the 
alternating least squares and majorization methods in this case was already observed by 
Kiers (1997). 


Kiers (1997), following some suggestions by Heiser (1995), proposes to use c VJ = 
max" =1 max™ =1 w^. Thus all elements of c tl are the same, say equal to c, and 


}A k ) _ y( fc ) , lr 'J ( _ r 

'Hj —' ' h ij ' v-^y 




This is improved by Groenen, Giaquinto, and Kiers (2003) by using c^- = maxjb 1 
means that we can write 


; 7 ( fc ) _ y( fc ) , ' U ^i( r _ T 

'Hj — ,h ij ' v-^y - h i] 

Ci 


(k)s 


W 


y i 


which 


Both Kiers (1997) and Groenen, Giaquinto, and Kiers (2003) compare their majorization 
algorithms with the alternating least squares method of Gabriel and Zamir (1979). The 
conclusion seems to be that the algorithm of Kiers (1997), which uses a scalar bound for 
the w^ is slower than criss-cross regression, while the algorithm of Groenen, Giaquinto, and 
Kiers (2003), which uses a row-wise bound, is actually faster than criss-cross regression. 
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It is clear that both Kiers (1997) and Groencn, Giaquinto, and Kiers (2003) use c i3 of the 
form Cij = U{Vj, with different choices of u and v. We will now look at some general theory 
for weight majorizations of this form, and see if we can come up with possible improvements. 

2.4 Rate of Convergence 

The algorithmic map A : M nxm — y M nxm j n the case that C = uv' is defined as the composition 
of three maps. 

• The first is the affine map Q(Z) := U^(Z + F * (. X — Z)) W, where * is used for the 
Hadamard product, and F has elements f %1 = w l3 /c y -. The diagonal matrices U and V 
have the elements of u and v on the diagonal. 

• The second map T p non-linearly projects its argument, using the Frobenius norm, on 
the set of all matrices of rank less than or equal to p. 

• And finally we have the linear map S(Z) := U ~2 ZV~ 2 . 

Combining these three maps, all from M nxm to M nxm , 

A(Z) = S{T P (6(Z))). 

The asymptotic error constant of the iterative majorization procedure is the spectral radius 
of the derivative of the algorithmic map A. So we now proceed to give an expression for that 
derivative. By the chain rule 

A(z + A) = a(z) + s(vr p (g(z))n( A)) + ?(||A||), 

where ?i( A) = U^(E -k A)V^ and E is the matrix with elements 1 — Wij/cij. 

The derivatives of T p in the rectangular case are computed in De Leeuw (2008) (the symmetric 
case is in De Leeuw (2016b)). From De Leeuw (2008), using the representation r p (Q(Z)) = 
Q(Z)L p L' p , with L p the normalized eigenvectors corresponding with the p largest eigenvalues 

of g(zyg(z), 


Vr„(G(Z))H(A) = M(A)L r L’ p - S(Z){E p (Z, A) + %(Z, A)}, 

where 

E„(Z, A) = f(g(Z)'S(Z) - Ap)+(g(Z)'H(A) + H(A )'S(Z))l,C 

5=1 

This allows us to compute the partial derivatives by using unit matrices (one element one, the 
others zero) for A. Directional derivatives for general A are implemented in lrPerturbO and 
the matrix of partials in computed in lrDeriveO, both functions are in the file lowrank.R. 

3 One-sided Approximation 

As a consequence of our previous discussions it makes sense to look for u > 0 and v > 0 such 
that U{Vj > Wij for all i,j, but, given the constraints, the matrix with elements UiVj is as 
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small as possible. There are many ways in which we can formalize “as small as possible”. 
In De Leeuw (2017) convex analysis and duality were used in a different and more general 
context. In this paper we use the notion of approximation from above, i.e. we minimize some 
norm \\W — uv'\\ over uv' > W and u,v > 0. 

In order to make this a convex problem we take logaritms. Thus we minimize || log W — (log u + 
log v) || over log u t + log Vj > log Wij for all i and j. One-sided discrete linear approximation for 
general £ p norms is discussed in Watson (1973). We will use least squares on the logarithms, 
which leads to a quadratic programming problem. But l\ and are both interesting as well, 
and lead to linear programming problems. 

Using logarithms only makes sense if uy,- > 0. For = 0 there is an obvious way to 
proceed, by just skipping those elements. The tq and Vj computed over the non-zero vj %] will 
automatically satisfy rqtq > = 0 for the zero weights as well. 

Suppose M. is the set of pairs (i,j) such that > O.To solve minimization of 

J2 (l°g W ij - l °SUi - log Vj ) 2 
(i,j)eM 

over over logtq + logir,- > log w tJ we use the \lsi() function from the lsei package (Wang, 
Lawson, and Hanson (2015)). 


4 Example 

The data are crash data on New Zealand roads, from the VGAM package (Yee (2010)). Data 
are cross-classified by time (hour of the day) and day of the week, accumulated over 2009. 
We use crashi, which gives the number of injuries by caused by cars. 


## 


Mon 

Tue 

Wed 

Thu 

Fri 

Sat 

Sun 

## 

0 

16 

10 

22 

12 

29 

55 

55 

## 

1 

13 

11 

15 

23 

23 

42 

64 

## 

2 

5 

8 

16 

13 

24 

37 

64 

## 

3 

6 

4 

8 

12 

19 

31 

45 

## 

4 

7 

6 

11 

16 

11 

35 

35 

## 

5 

12 

14 

14 

18 

19 

27 

35 

## 

6 

37 

37 

32 

45 

32 

21 

36 

## 

7 

66 

79 

92 

75 

73 

40 

33 

## 

8 

117 

138 

132 

138 

122 

59 

36 

## 

9 

67 

81 

68 

75 

72 

59 

45 

## 

10 

67 

70 

62 

76 

72 

84 

57 

## 

11 

80 

80 

50 

80 

74 

114 

86 

## 

12 

75 

85 

86 

87 

94 

101 

93 

## 

13 

77 

69 

84 

90 

90 

105 

80 

## 

14 

84 

87 

98 

85 

104 

103 

96 

## 

15 

112 

136 

134 

156 

158 

120 

103 

## 

16 

115 

110 

138 

144 

146 

106 

90 
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## 

17 

127 

130 

140 

149 

155 

104 

97 

## 

18 

63 

69 

91 

97 

142 

83 

64 

## 

19 

47 

63 

53 

57 

67 

69 

52 

## 

20 

25 

46 

62 

55 

68 

70 

44 

## 

21 

34 

42 

49 

53 

85 

62 

33 

## 

22 

24 

26 

35 

52 

67 

54 

18 

## 

23 

28 

23 

20 

49 

61 

69 

29 


We pretend the data are independent Poisson counts, and we choose weights equal to . This 
makes the minimum of the loss function a chi-square random variable, with the appropriate 
number of degrees of freedom. The optimal rank-one approximation from above is computed 
wth the program mboundO from the appendix. The optimal uv' is plotted against w in figure 
1 and the optimal u and v are plotted in figure 2. 



w 


Figure 1: Best Rank-1 Approximation Weights 
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hour 

Figure 2: Components of One-dimensional Bounds 


day 


The wPCAO function has a bnd parameter, which can be equal to all, or col, or row, or opt. 
If bnd="all" then c tJ = max w t j , as in Kiers (1997), if bnd="row" then Cij = max^u^-, as in 


Groenen, Giaquinto, and Kiers (2003). 
then we compute the optimal UiVj. 


If bnd="col" then Qy = max, vj i3 and if bnd="opt" 


First look at p = 1. For all four values of the bnd parameter we find a chi-square equal to 
709.9526292976 with 168 — ((24 + 7) — 1) = 138 degrees of freedom. The iteration count for 
bnd="all", "col", "row", "opt" is, respectively, 208, 151, 21, 17. Because in this example 
the between-row/within-column variation is so much larger than the within-row/between- 
column variation, it makes sense that bnd="row" performs well, almost as good as bnd="opt", 
while bnd="col" is bad, almost as bad as bnd="all". The one-dimensional plots of A and 
B from X = ab' are in figure 3. 






hour 


day 


Figure 3: Components of One-dimensional Matrix Approximation 

We also computed the spectral norms of the derivative at the solution, using the functions 
in lowrank.R. For bnd="all", "col", "row", "opt" the convergence rate is, respectively, 
0.9710924907, 0.955475149, 0.660381091, 0.6152936489. 


For p = 2 for all four values of the bnd parameter we find chi-square equal to 215.349822881 
with 168 — ((24 + 7) * 2 — 4) = 110 degrees of freedom. The iteration count for 
bnd="all", "col", "row" , "opt" is, respectively, 164, 99, 46, 35. 




dim 1 


dim 1 
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Figure 4: Components of Two-dimensional Matrix Approximation 

For bnd="all", "col", "row", "opt" the convergence rate is, respectively, 0.9715807406, 
0.961724624, 0.9042846128, 0.8856193743. 

5 Discussion 

The main question, which we did not answer, is if the decrease in computer time by decreasing 
the number of iterations is offset by the extra computation of the optimal rank-one bounds. 
Computing the optimal bounds is a quadratic programming problem of size nm, which can be 
substantial and rapidly becomes impractical for really large data sets. In that case, alternative 
strategies are needed, for example starting with row or column bounds and performing only 
a few iterations of an iterative quadratic programming method. Of course we also have not 
tried l\ and loss functions to define the optimal weights, which have linear programming 
solutions that are less expensive than the quadratic ones. 

Also note that in each majorization step we have solved a low-rank approximation problem. 
We also have a convergent algorithm if we replaced the singular value decomposition in each 
majorization step by one or more alternating least squares iterations. This will generally 
lead to more majorization iterations, but each majorization iteration becomes less expensive. 
In addition we can use A and B from X = AB' for the previous majorization iteration as 
starting points for the next one. 

There is an interesting recent paper by Szlarn, Tulloch, and Tygert (2017), who show that 
just a few alternating least squares iterations will generally provide a very good low-rank 
approximation. This implies that it may not pay off to optimize the number of iterations 
needed for convergence, if convergence is not really of much practical value anyway. It also 
indicates that performing only a small number of “inner” alternating least squares iterations in 
each majorization step is probably a good strategy. Again, that is something to be explored. 

6 Appendix: Code 

6.1 mbound.R 

mbound <- function (w) { 
n <- nrow (w) 
m <- ncol (w) 
ww <- log (w) 
gl <- matrix (0, n * m, n) 
g2 <- matrix (0, n * m, m) 
k <- 1 

for (j in l:m) { 
for (i in l:n) { 
gl[k, i] <- 1 
g2[k, j] <- 1 
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k <- k + 1 


> 

} 

g <- cbind (gl, g2) 
f <- log (as.vector (w)) 
h <- lsi (g, f, g, f) 
u <- exp (h[l :n]) 
v <- exp (h[n + (l:m)]) 
return (list (u = u, v = v)) 


6.2 wPCA.R 

gfunc <- function (z, x, u, v, w) { 

return ((z + (w / outer (u, v)) * (x - z)) * sqrt(outer(u,v))) 

> 

sfunc <- function (z, u, v) { 
return (z / sqrt (outer (u, v))) 

> 

lfunc <- function (x, p) { 

1 <- as.matrix(eigen (crossprod(x) ) $vectors [,1:p]) 
return (x °/ 0 *7 0 tcrossprod (1)) 

> 

wPCA <- 

function (x, 
w, 

p = 2, 

bnd = "opt", 
itmax = 1000, 
eps = le-6, 
verbose = FALSE) { 
n <- nrow (x) 
m <- ncol (x) 
if (bnd == "all") { 
u <- rep(l, n) 
v <- rep (max (w), m) 

> 

if (bnd == "row") { 
u <- apply (w, 1, max) 
v <- rep (1, m) 

> 
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if (bnd == "col") { 
u <- rep (1, n) 
v <- apply (w, 2, max) 

> 

if (bnd == "opt") { 
uv <- mbound (w) 

U <- uv$u 
V <- uv$v 

> 

zold <- lfunc (x, p) 

fold <- sum (w * (x - zold) ~ 2) 

itel <- 1 

repeat { 

zl <- gfunc (zold, x, u, v, w) 
z2 <- lfunc (zl, p) 
znew <- sfunc (z2, u, v) 
fnew <- sum (w * (x - znew) ~ 2) 
if (verbose) { 
cat ( 

"iteration ", 
formate ( 
itel, 

digits = 0, 
width = 3, 
format = "d" 

), 

" fold", 
formate ( 
fold, 

digits = 6, 
width = 10, 
format = "f" 

), 

" fnew" , 
formate ( 
fnew, 

digits = 6, 
width = 10, 
format = "f" 

), 

"\n" 

) 

} 

if (((fold - fnew) < eps) || (itel == itmax)) 
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break 

itel <- itel + 1 


zold <- znew 
fold <- fnew 

> 

return (list (z = znew, f = fnew, itel = itel, u 


u, v = v)) 


6.3 lowrank.R 

gfunc <- function (z, x, u, v, w) { 
return ((z + (w / outer (u, v)) * (x - z)) * sqrt(outer (u,v))) 

> 

sfunc <- function (z, u, v) { 
return (z / sqrt (outer (u, v))) 

> 

lfunc <- function (x, p) { 

1 <- as.matrix(eigen (crossprod(x) ) $vectors [, 1: p]) 
yy <- x tcrossprod (1) 
return (yy) 

} 

ffunc <- function (z, x, p, u, v, w) { 
zl <- gfunc (z, x, u, v, w) 
z2 <- lfunc (zl, p) 
z3 <- sfunc (z2, u, v) 
return (as . vector(z3) ) 

} 

cfunc <- function (z, x, p, u, v, w) { 
n <- length (u) 
m <- length (v) 
z <- matrix (z, n, m) 
return (ffunc (z, x, p, u, v, w)) 

> 

hfunc <- function (delta, u, v, w) { 

((1 - w / outer (u, v)) * delta) * sqrt (outer (u, v)) 

> 

lrPerturb <- function(z, delta, x, p, u, v, w) { 
m <- ncol (x) 
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gz <- gfunc (z, x, u, v, w) 

hd <- hfunc (delta, u, v, w) 

gh <- crossprod(gz , hd) + crossprod(hd, gz) 

hp <- matrix (0, m, m) 

e <- eigen(crossprod(gz) ) 

It <- e$vectors 
vt <- e$values 
lp <- as.matrix(lt [, 1: p]) 
for (s in l:p) { 

vi <- ifelse (vt == vt [s], 0, 1 / (vt - vt[s])) 

cinv <- It 1*1 diag (vi) 1*1 t(lt) 

hp <- hp + cinv 1*1 gh 1*1 outer (lp[, s] , lp[, s]) 

} 

dz <- hd 1*1 tcrossprod(lp) - gz 1*1 (hp + t(hp)) 
return (sfunc (dz, u, v)) 

} 

lrDerive <- function(z, x, p, u, v, w) { 
n <- nrow(z) 
m <- ncol(z) 
k <- 1 

dz <- matrix (0, n * m, n * m) 
for (j in l:m) 
for (i in l:n) { 

dx <- matrix(0, n, m) 
dx[i, j] <- 1 

dz[, k] <- as.vector(lrPerturb(z, dx, x, p, u, v, w)) 
k <- k + 1 

> 

return(dz) 
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